Skip to content

fix: when setting fsdp size unuse megatron for gather in npu#185

Merged
tastelikefeet merged 2 commits intomodelscope:mainfrom
0hujun:fsdp2+cp
Apr 24, 2026
Merged

fix: when setting fsdp size unuse megatron for gather in npu#185
tastelikefeet merged 2 commits intomodelscope:mainfrom
0hujun:fsdp2+cp

Conversation

@0hujun
Copy link
Copy Markdown
Contributor

@0hujun 0hujun commented Apr 23, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Fix issue #184 .
When using Ascend NPU for FSDP2 + CP train, got not init error cause import megatron package and using it to get dp group that is not initialized.
This error may be introduced from this pr #153 .
Because using FSDP2 will set fsdp_size and using Megatron will not set fsdp_size, so just check fsdp whether in DeviceMesh , and add it in if code line.

Experiment results

Works fine.

[2026-04-23 10:57:54][INFO:twinkle] Current is step 0 of 125, metric: {'loss': '3.1660', 'accuracy': '0.48', 'correct_tokens': 189, 'total_tokens': 392, 'learning rate(param group 1)': '0.000000e+00', 'learning rate(param group 2)': '0.000000e+00', 'iters': 0, 'total time elapse': '11 seconds', 'speed': '0.00 iters/s'}
[2026-04-23 10:58:17][INFO:twinkle] Current is step 20 of 125, metric: {'loss': '2.6121', 'grad_norm': '4.062500', 'accuracy': '0.54', 'correct_tokens': 3275, 'total_tokens': 6050, 'learning rate(param group 1)': '9.957224e-05', 'learning rate(param group 2)': '9.957224e-05', 'iters': 10, 'total time elapse': '33 seconds', 'speed': '0.45 iters/s'}
[2026-04-23 10:58:35][INFO:twinkle] Current is step 40 of 125, metric: {'loss': '1.4322', 'grad_norm': '3.062500', 'accuracy': '0.67', 'correct_tokens': 4335, 'total_tokens': 6455, 'learning rate(param group 1)': '9.619398e-05', 'learning rate(param group 2)': '9.619398e-05', 'iters': 20, 'total time elapse': '52 seconds', 'speed': '0.53 iters/s'}
[2026-04-23 10:58:54][INFO:twinkle] Current is step 60 of 125, metric: {'loss': '0.9267', 'grad_norm': '3.781250', 'accuracy': '0.75', 'correct_tokens': 4581, 'total_tokens': 6087, 'learning rate(param group 1)': '8.966767e-05', 'learning rate(param group 2)': '8.966767e-05', 'iters': 30, 'total time elapse': '70 seconds', 'speed': '0.55 iters/s'}
[2026-04-23 10:59:12][INFO:twinkle] Current is step 80 of 125, metric: {'loss': '0.6422', 'grad_norm': '2.062500', 'accuracy': '0.81', 'correct_tokens': 5232, 'total_tokens': 6426, 'learning rate(param group 1)': '8.043807e-05', 'learning rate(param group 2)': '8.043807e-05', 'iters': 40, 'total time elapse': '89 seconds', 'speed': '0.53 iters/s'}
[2026-04-23 10:59:31][INFO:twinkle] Current is step 100 of 125, metric: {'loss': '0.4760', 'grad_norm': '2.328125', 'accuracy': '0.86', 'correct_tokens': 5322, 'total_tokens': 6201, 'learning rate(param group 1)': '6.913417e-05', 'learning rate(param group 2)': '6.913417e-05', 'iters': 50, 'total time elapse': '107 seconds', 'speed': '0.55 iters/s'}
[2026-04-23 10:59:49][INFO:twinkle] Current is step 120 of 125, metric: {'loss': '0.3139', 'grad_norm': '2.250000', 'accuracy': '0.91', 'correct_tokens': 5808, 'total_tokens': 6411, 'learning rate(param group 1)': '5.652631e-05', 'learning rate(param group 2)': '5.652631e-05', 'iters': 60, 'total time elapse': '126 seconds', 'speed': '0.53 iters/s'}

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the gather_object function in src/twinkle/utils/framework.py to conditionally bypass a specific NPU workaround when 'fsdp' is present in the device mesh. A review comment points out a potential TypeError when accessing mesh_dim_names directly and suggests using the has_dim method instead to safely check for the dimension.

Comment thread src/twinkle/utils/framework.py Outdated
@tastelikefeet tastelikefeet merged commit 061903e into modelscope:main Apr 24, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants